Skip to content

Add range partitioning sqllogictest fixture#22607

Merged
alamb merged 3 commits into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/05/range-partitioning-slt-source
May 29, 2026
Merged

Add range partitioning sqllogictest fixture#22607
alamb merged 3 commits into
apache:mainfrom
gene-bordegaray:gene.bordegaray/2026/05/range-partitioning-slt-source

Conversation

@gene-bordegaray

@gene-bordegaray gene-bordegaray commented May 29, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This adds a focused sqllogictest fixture for source-provided Range partitioning before changing optimizer behavior. It follows the direction discussed in #21992 and gives later planning PRs stable baselines for current behavior.

What changes are included in this PR?

  • Registers a range_partitioned test table for range_partitioning.slt.
  • Adds a sqllogictest-only source wrapper that reports Range partitioning when range_key is projected, and UnknownPartitioning when it is not.
  • Adds baselines for grouping on the range key, grouping on a non-range key, joining on the range key, and UNION ALL over range-partitioned inputs.

Are these changes tested?

Yes.

  • cargo fmt --all
  • cargo test -p datafusion-sqllogictest --test sqllogictests range_partitioning
  • cargo clippy --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is test-only infrastructure and sqllogictest coverage.

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label May 29, 2026
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

cc: @NGA-TRAN @gabotechs @alamb

Comment thread datafusion/sqllogictest/src/test_context.rs Outdated
Comment thread datafusion/sqllogictest/test_files/range_partitioning.slt
Comment thread datafusion/sqllogictest/test_files/range_partitioning.slt Outdated
}
}

fn register_range_partitioned_table(ctx: &SessionContext) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 could it be feasable to have a table function instead? just in case this single table is not capable of satisfying all corner cases in the future.

Just food for thought, if you think a hardcoded table is good then let's stick with it (it's actually simpler)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think its feasbile and would be useful for future PRs. Ditto to my comment about this being as simple as possible

Comment thread datafusion/sqllogictest/test_files/range_partitioning.slt
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/range-partitioning-slt-source branch 2 times, most recently from 366e4dc to d0ede58 Compare May 29, 2026 10:43
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2026/05/range-partitioning-slt-source branch from d0ede58 to ada1564 Compare May 29, 2026 10:43
@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label May 29, 2026

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gene-bordegaray -- I think this is a good step and could be merged as is. However, I have some suggestions for follow on work/how to make this simpler and more general. Let me know what you think

Thank you @gabotechs and @asolimando for your reviews

Comment thread datafusion/physical-expr/src/partitioning.rs Outdated
Comment thread datafusion/sqllogictest/src/test_context.rs Outdated
}

impl RangePartitionedTable {
fn output_partitioning(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize it would take so much ceremony to get partitioned data (aka create an entire DataSource/TableProvider) and I agree with @gabotechs that if we go down this approach it will be hard to generate all the possible cases we would like to test.

SO I would suggest we start by integrate pre-defined partitioning in higher level APIs, starating with FileScanConfig, which I think will be needed eventually to use this feature

I dug around a bit and it seems like it will be straightforward to make FileScanConfig::with_partitioned_by_file_group more general

/// Set whether file groups are organized by partition column values.
///
/// When set to true, the output partitioning will be declared as Hash partitioning
/// on the partition columns.
pub fn with_partitioned_by_file_group(
mut self,
partitioned_by_file_group: bool,
) -> Self {
self.partitioned_by_file_group = partitioned_by_file_group;
self
}

What I suggest we do (could be a follow on PR)

  1. Update FileScanConfig so it specifies a predefined output partitioning (output_partitioning: Option<Partitioning>) rather than a bool
  2. Deprecate FileScanConfig::with_partitoned_by_file_groups
  3. Add a new FileScanConfig::with_output_partitioning similar to FileScanConfig::with_output_ordering
  4. Use the DataSourceExec::from(...) for that config

That will save a lot of boiler plate in this setup, and I think you'll need the more general form to take advantage of range partitioning externally (aka it won't be wasted code)

Eventually, I think we should be targeting ListingTable s that declared it was RangePartitioned via ListingOptions (and thus eventually via SQL) .

However, it seems like ListingTable/ListingOptions don't have the output hooks yet and we need to buid there incrementally
https://github.com/apache/datafusion/blob/4c909bafc5c50749884fdd80a06235d7bd72dbde/datafusion/catalog-listing/src/options.rs#L32-L31

@gene-bordegaray gene-bordegaray May 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a good clean up. I would be ok with doing this in this PR but think it would be better on its own as I think it will have some ripppling effects

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poke around and will see if it is feasible to add in here

@gene-bordegaray gene-bordegaray May 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think that the FileScanConfig::with_output_partitioning(...) with the ListingOptions::with_output_partitioning(...) is the right move. It will help us declaring range partitioning on actual file/listing tables so that is great

I think we can leave the FileScanConfig / ListingTable work as a follow-up. There is some things we need to ahndle there I don't think belong in here.

There is a smaller cleanup where MemorySourceConfig could have an output_partitioning builder, which would clean up a lot of the boilerplate, but after looking at it I don’t think it is good because adding it only to MemorySourceConfig feels like a one-off public API just for this fixture.

I would be ok with moving this to its own module now and clean up with the right fix after.

Comment thread datafusion/sqllogictest/test_files/range_partitioning.slt
Comment thread datafusion/sqllogictest/test_files/range_partitioning.slt
);
}

// ==============================================================================

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really need a few hundred lines of setip, maybe we can put it into a module like datafusion/sqllogictest/src/test_context/range_partitioning.rs

However, I have suggestions below that I think could make this substantially smaller

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing this with the justification in comment on refactoring. If we are ok with this path I can open an issue 😄

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @gene-bordegaray - I think this is a good start and we can iterate on this / clean it up in follow ons (clearly I trust that you will do so 😆 )

Since it is just test infrastructure I'll put it in the merge queue now

@alamb alamb added this pull request to the merge queue May 29, 2026
Merged via the queue into apache:main with commit b6d4c25 May 29, 2026
39 checks passed
@gene-bordegaray

Copy link
Copy Markdown
Contributor Author

I have made a follow up issue on enhancing the file/listing scan APIs for output_partitioning: #22645

alamb pushed a commit to zfarrell/datafusion that referenced this pull request Jun 23, 2026
…_partitioning` for pre-defined file partitioning (apache#22657)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#22645.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This follows up on apache#22607 by replacing range-partitioning sqllogictest
boilerplate with a general file/listing scan API for declared output
partitioning.

Related: apache#21992, apache#22607,
apache#22607 (comment)

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Add declared `output_partitioning` to file scan and listing table
configuration.
- Preserve declared partition counts during listing-table file grouping.
- Serialize scan `output_partitioning` through physical plan proto.
- Refactor `range_partitioning.slt` to use a CSV `ListingTable` instead
of a custom test-only `TableProvider` / `DataSource`.

Contract:
- Declared partitioning expressions are written against the full table
schema before scan projection. For example, `Range([range_key@0], [(10),
(20)], 3)` remains valid if the scan projects `range_key` and falls back
to `UnknownPartitioning(3)` if `range_key` is not projected.
- Listing tables create one file group per declared output partition
(which can exceed `target_partitions`). It is up to the user to plan
their partitioning. For example, a 4-partition range declaration creates
four scan file groups, adding empty trailing groups when fewer files are
present.
- File group index is part of the contract: file group `i` must contain
rows for declared output partition `i`. DataFusion does not validate row
placement, matching other user-declared properties such as sortedness.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes. This adds public API for declaring file/listing scan output
partitioning. No breaking API changes.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Gabriel <45515538+gabotechs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants